What's new in Seastar - issue 3

I/O #

file: call close() without the syscall thread #1695 introduces an optimization which attempts to avoid running close (the system call) on the system call thread when it is believed that the call will not block.

Blocking systems calls are run on a dedicated system call thread in Seastar to avoid blocking the reactor. Generally it is assumed that any file system operations may block. However, in https://github.com/scylladb/seastar/issues/1689 it was observed that there were a large number of close calls made on read-only files.

Should closing a read-only file block? What about a file opened in read-write mode? After some investigation it was believed that close would never block (e.g. flushing pending data), but it turned out that some file systems–network file systems in particular–did flush on close. So the decision was made to run close directly on the reactor for read-only files.

Pedantically there is a bit of a risk here since the blocking semantics of close are dependent on the file system, and the implementation, which may change. So it must be fairly important to reduce the amount of times system calls are scheduled to run off the reactor thread. Every little bit matters.

Apparently this is not a concern when using io_uring which can put close onto the submission queue.

In an e-mail thread on seastar-dev Travis Downs asks: can Seastar provide backpressure to applications that generate I/O?

When applications submit I/O to Seastar it will be queued and later dispatched to the kernel. However, applications can generally submit as much I/O as they want, even if it won’t be submitted to the kernel immediately. So when an application has a lot of data it could submit, it would be useful to apply back pressure further upstream. By holding off on submission of I/O to Seastar less memory can be used, or as Travis mentions, those I/Os can continue to be modified until the last second.

The conclusion is that there isn’t a solution right now, but the thread does contain a proposal.

Utilities #

gate: track holders #1676 introduces extra tracking to Seastar’s gate utility which helps with debugging. Normally a gate tracks its members using a counter. This PR introduces an intrusive list which explicitly tracks each gate::holder instance. This allows the gate to perform checks, such as asserting that there are no active holders when the gate is moved or destroyed.

This tracking also enables the set of holders to be inspected. This could be useful in cases where it is taking a long time for a gate to be closed. There are no current APIs for inspecting the holders, but one could attach a debugger or use a core file to inspect this information using a tool like https://github.com/scylladb/scylladb/blob/master/scylla-gdb.py which has functions for inspecting Seastar data structures in memory.

log: report scheduling group along with shard id #1666 annotates each log message with the current scheduling group. Here is the sample output from the PR:

INFO  2023-06-15 16:23:35,396 seastar - Reactor backend: io_uring
INFO  2023-06-15 16:23:35,397 [shard  0:n/a ] seastar - Perf-based stall detector creation failed (EACCESS), try setting /proc/sys/kernel/perf_event_paranoid to 1 or less to enable kernel backtraces: falling back to posix timer.
INFO  2023-06-15 16:23:35,398 [shard  0:main] seastar - Created fair group io-queue-0 for 16 queues, capacity rate 2147483:2147483, limit 12582912, rate 16777216 (factor 1), threshold 2000, per tick grab 786432
INFO  2023-06-15 16:23:35,399 [shard  0:main] seastar - IO queue uses 0.75ms latency goal for device 0

In the snippet, “n/a” corresponds to log messages printed early on before task queues have been created.

The feature allows a shortname to be associated with each scheduling group that an application creates. If no shortname is given, then a heuristic is applied to derive a 4 letter shortname from the full name of the scheduling group. Example user in Scylla https://github.com/scylladb/scylladb/pull/15821.

shared_ptr: deprecate lw_shared_ptr operator=(T&&) #1715 removes a footgun from Seastar’s shared pointer implementation. The following move-assignment overload was added 9 years ago.

shared_ptr& operator=(T&& x) {
    this->~shared_ptr();
    new (this) shared_ptr(new data(std::move(x)));
    return *this;
}

This overload recently burned me when I had mistakenly typed p = std::string(); (where p is a seastar::shared_ptr<std::string>). What I had intended to write was *p = std::string(), expecting holders of copies of p to see the update. For what it’s worth, std::shared_ptr doesn’t implement this overload, and thus won’t allow the same mistake to compile.

Memory #

Prefault memory when –lock-memory 1 is specified #1702 improves latencies in some cases when using transparent huge pages. Locking memory is important to avoid swapping which can introduce tail latencies. Normally when memory is locked a fast page fault occurs the first time the page is touched by the application. However, as detailed in the PR it was observed that with transparent huge pages there are cases where a defragmentation process may perform work at the time a page is faulted, resulting in additional observed latencies.

This PR addresses the issue by spinning up a number of threads when Seastar starts which attempts to pre-fault memory to avoid or at least reduce this problem. The idea is that the memory is pre-faulted by the time the application touches the memory region for the first time.

The PR has some interesting bits about avoiding compiler optimizations by using inline assembly, as well as how the faulting threads are mapped to NUMA domains without using using a fixed CPU affinity to avoid as much contention as possible with the reactor threads.

memory: diable transparent hugepages if –overprovisioned is specified #1796 disables transparent huge pages when a Seastar application is started with the command line flag --overprovisioned. This flag is used to indicate that the application should “play nice” with the rest of the system, and THP is not always considered a team player.

Reactor #

Add a stall detector histogram #1675 adds histogram output to the reactor metrics that track reactor stalls.